Statement¶

The insurance company Protect Your Tomorrow wants to solve some tasks with the help of machine learning and you need to evaluate the possibility of doing so.

  • Task 1: Find customers similar to a particular customer. This will help the company's agents with marketing tasks.
  • Task 2: Predict whether a new customer is likely to receive an insurance payout. Can a prediction model be better than a dummy model?
  • Task 3: Predict the number of insurance payments a new customer is likely to receive using a linear regression model.
  • Task 4: Protect customers' personal data without spoiling the model from the previous task. It is necessary to develop a data transformation algorithm that would make it difficult to recover personal information if the data fell into the wrong hands. This is called data masking or data obfuscation. But the data must be protected in such a way that the quality of the machine learning models does not deteriorate. You don't have to choose the best model, just prove that the algorithm works correctly.

Data pre-processing & exploration¶

Initialization¶

In [1]:
pip install scikit-learn --upgrade
Requirement already satisfied: scikit-learn in c:\users\guilherme\appdata\local\programs\python\python311\lib\site-packages (1.3.0)
Requirement already satisfied: numpy>=1.17.3 in c:\users\guilherme\appdata\local\programs\python\python311\lib\site-packages (from scikit-learn) (1.25.2)
Requirement already satisfied: scipy>=1.5.0 in c:\users\guilherme\appdata\local\programs\python\python311\lib\site-packages (from scikit-learn) (1.11.2)
Requirement already satisfied: joblib>=1.1.1 in c:\users\guilherme\appdata\local\programs\python\python311\lib\site-packages (from scikit-learn) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\guilherme\appdata\local\programs\python\python311\lib\site-packages (from scikit-learn) (3.2.0)
Note: you may need to restart the kernel to use updated packages.
In [2]:
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.metrics import r2_score
import seaborn as sns
import math
import sklearn.metrics
import sklearn.linear_model
import sklearn.metrics
import sklearn.neighbors
import sklearn.preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import train_test_split
from scipy.spatial import distance
from sklearn.ensemble import RandomForestRegressor
from IPython.display import display

Loading Data¶

Load the data and do a basic check that it is free of obvious problems.

In [3]:
df = pd.read_csv("C:\\Users\\Guilherme\\Downloads\\insurance_us.csv")

We've renamed the columns to make the code more consistent with your style.

In [4]:
df = df.rename(columns={'Gender': 'gender', 'Age': 'age', 'Salary': 'income', 'Family members': 'family_members', 'Insurance benefits': 'insurance_benefits'})
In [5]:
df.sample(10)
Out[5]:
gender age income family_members insurance_benefits
4651 0 28.0 33300.0 2 0
3712 1 23.0 49600.0 0 0
2199 0 28.0 29400.0 3 0
333 0 32.0 25600.0 1 0
1289 0 21.0 54800.0 1 0
2567 0 47.0 32800.0 0 1
4216 0 30.0 60400.0 1 0
248 0 31.0 49400.0 1 0
530 1 36.0 21200.0 1 0
3623 0 28.0 39300.0 0 0
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   gender              5000 non-null   int64  
 1   age                 5000 non-null   float64
 2   income              5000 non-null   float64
 3   family_members      5000 non-null   int64  
 4   insurance_benefits  5000 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 195.4 KB
In [7]:
df['age'] = df['age'].astype('int')
In [8]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   gender              5000 non-null   int64  
 1   age                 5000 non-null   int32  
 2   income              5000 non-null   float64
 3   family_members      5000 non-null   int64  
 4   insurance_benefits  5000 non-null   int64  
dtypes: float64(1), int32(1), int64(3)
memory usage: 175.9 KB
In [9]:
df.describe()
Out[9]:
gender age income family_members insurance_benefits
count 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000
mean 0.499000 30.952800 39916.360000 1.194200 0.148000
std 0.500049 8.440807 9900.083569 1.091387 0.463183
min 0.000000 18.000000 5300.000000 0.000000 0.000000
25% 0.000000 24.000000 33300.000000 0.000000 0.000000
50% 0.000000 30.000000 40200.000000 1.000000 0.000000
75% 1.000000 37.000000 46600.000000 2.000000 0.000000
max 1.000000 65.000000 79000.000000 6.000000 5.000000
  • Apparently the insurance_benefits column has many outliers
In [10]:
df.duplicated().sum()
Out[10]:
153
In [11]:
df =  df.drop_duplicates().reset_index(drop=True)
In [12]:
df.duplicated().sum()
Out[12]:
0
  • I used drop_duplicated because 153 rows are insignificant in a 5000 row dataframe.

AED¶

Let's quickly check whether there are certain groups of customers by looking at the pair chart.

In [13]:
g = sns.pairplot(df, kind='hist')
g.fig.set_size_inches(12, 12)
C:\Users\Guilherme\AppData\Local\Programs\Python\Python311\Lib\site-packages\seaborn\axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
In [14]:
px.scatter_3d(df,x='age',y='income',z='insurance_benefits', color = 'family_members')
  • We can see a relationship in which the group that receives the most insurance benefits is the older group over 40, with a closer relationship towards the 'average' with income, as age increases those selected are even more in the middle. And with few family members. Obviously taking gender equality into account.
In [15]:
df['gender'].value_counts()
Out[15]:
gender
0    2431
1    2416
Name: count, dtype: int64
  • This is confirmed

Okay, it's a bit difficult to identify obvious groups (clusters), because it's hard to combine several variables simultaneously (to analyze multivariate distributions). This is where Linear Algebra and Machine Learning can be very useful.

Task 1. Similar customers¶

Na linguagem de AM, é necessário desenvolver um procedimento que retorne k vizinhos mais próximos (objetos) para um determinado objeto com base na distância entre os objetos. Você pode querer rever as seguintes lições (capítulo -> lição)- Distância Entre Vetores -> Distância Euclidiana

  • Distância Entre Vetores -> Distância de Manhattan

Para resolver a tarefa, podemos tentar diferentes métricas de distância.

Escreva uma função que retorne k vizinhos mais próximos para um n-ésimo objeto com base em uma métrica de distância especificada. O número de pagamentos de seguro recebidos não deve ser levado em consideração para esta tarefa.

Você pode usar uma implementação pronta do algoritmo kNN do scikit-learn (verifique o link) ou usar a sua própria. Teste-o para quatro combinações de dois casos

  • Escalabilidade
    • os dados não são escalados
    • os dados escalados com o escalonador MaxAbsScaler
  • Métricas de distância
    • Euclidiana
    • Manhattan

Responda às perguntas:

  • Os dados não escalados afetam o algoritmo kNN? Se sim, como isso acontece?

-Quão semelhantes são os resultados usando a métrica de distância de Manhattan (independentemente da escalabilidade)?

In [17]:
feature_names = ['gender', 'age', 'income', 'family_members']
In [26]:
def get_knn(df, n, k, metric):
    
  
    
    if metric == 'euclidean':
        nbrs = NearestNeighbors(n_neighbors=k, metric='euclidean')
    elif metric == 'manhattan':
        nbrs = NearestNeighbors(n_neighbors=k, metric='manhattan')
    else:
        raise ValueError("Metric not supported. Choose 'euclidean' or 'manhattan'")

   
    nbrs = nbrs.fit(df[feature_names])
   
    nbrs_distances, nbrs_indices = nbrs.kneighbors([df.iloc[n][feature_names]], k, return_distance=True)
    
    df_res = pd.concat([
        df.iloc[nbrs_indices[0]], 
        pd.DataFrame(nbrs_distances.T, index=nbrs_indices[0], columns=['distance'])
        ], axis=1)
    
    return df_res

Scaling the data

In [20]:
feature_names = ['gender', 'age', 'income', 'family_members']

transformer_mas = sklearn.preprocessing.MaxAbsScaler().fit(df[feature_names].to_numpy())

df_scaled = df.copy()
df_scaled.loc[:, feature_names] = transformer_mas.transform(df[feature_names].to_numpy())
In [21]:
df_scaled.sample(5)
Out[21]:
gender age income family_members insurance_benefits
975 1 0.784615 0.367089 0.333333 2
2633 0 0.538462 0.586076 0.166667 0
2840 0 0.430769 0.300000 0.333333 0
2362 1 0.461538 0.603797 0.000000 0
2571 1 0.353846 0.626582 0.166667 0

Now, let's get similar records for a given record for each combination

In [22]:
get_knn(df, 1, 10, 'manhattan')
C:\Users\Guilherme\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:464: UserWarning:

X does not have valid feature names, but NearestNeighbors was fitted with feature names

Out[22]:
gender age income family_members insurance_benefits distance
1 0 46 38000.0 1 1 0.0
3810 0 40 38000.0 0 0 7.0
4796 1 37 38000.0 1 0 10.0
2480 1 36 38000.0 0 0 12.0
3498 0 33 38000.0 0 0 14.0
3760 0 32 38000.0 0 0 15.0
3437 0 27 38000.0 0 0 20.0
1876 1 26 38000.0 2 0 22.0
123 1 27 38000.0 5 0 24.0
1973 1 22 38000.0 1 0 25.0
In [23]:
get_knn(df, 1, 10, 'euclidean')
C:\Users\Guilherme\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:464: UserWarning:

X does not have valid feature names, but NearestNeighbors was fitted with feature names

Out[23]:
gender age income family_members insurance_benefits distance
1 0 46 38000.0 1 1 0.000000
3810 0 40 38000.0 0 0 6.082763
4796 1 37 38000.0 1 0 9.055385
2480 1 36 38000.0 0 0 10.099505
3498 0 33 38000.0 0 0 13.038405
3760 0 32 38000.0 0 0 14.035669
3437 0 27 38000.0 0 0 19.026298
123 1 27 38000.0 5 0 19.442222
1876 1 26 38000.0 2 0 20.049938
1973 1 22 38000.0 1 0 24.020824
In [24]:
get_knn(df_scaled, 1, 10, 'euclidean')
C:\Users\Guilherme\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:464: UserWarning:

X does not have valid feature names, but NearestNeighbors was fitted with feature names

Out[24]:
gender age income family_members insurance_benefits distance
1 0 0.707692 0.481013 0.166667 1 0.000000
4041 0 0.707692 0.477215 0.166667 1 0.003797
1835 0 0.707692 0.492405 0.166667 1 0.011392
4833 0 0.723077 0.491139 0.166667 1 0.018418
4341 0 0.692308 0.459494 0.166667 1 0.026453
2394 0 0.676923 0.482278 0.166667 1 0.030795
1634 0 0.676923 0.486076 0.166667 1 0.031183
3574 0 0.676923 0.470886 0.166667 1 0.032393
1650 0 0.738462 0.460759 0.166667 1 0.036837
4629 0 0.692308 0.445570 0.166667 1 0.038638
In [25]:
get_knn(df_scaled, 1, 10, 'manhattan')
C:\Users\Guilherme\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py:464: UserWarning:

X does not have valid feature names, but NearestNeighbors was fitted with feature names

Out[25]:
gender age income family_members insurance_benefits distance
1 0 0.707692 0.481013 0.166667 1 0.000000
4041 0 0.707692 0.477215 0.166667 1 0.003797
1835 0 0.707692 0.492405 0.166667 1 0.011392
4833 0 0.723077 0.491139 0.166667 1 0.025511
2394 0 0.676923 0.482278 0.166667 1 0.032035
1634 0 0.676923 0.486076 0.166667 1 0.035833
4341 0 0.692308 0.459494 0.166667 1 0.036904
55 0 0.707692 0.441772 0.166667 1 0.039241
3574 0 0.676923 0.470886 0.166667 1 0.040896
3462 0 0.707692 0.439241 0.166667 1 0.041772

Does unscaled data affect the kNN algorithm? If so, how does this happen?

Yes, it does, making the calculated distance larger.

Differences in scale can lead to a greater influence of some characteristics over others. For example, if a feature has much larger values compared to others, it will dominate the distances calculated by kNN, even if other features are more relevant to the task.

How similar are the results using the Manhattan distance metric (regardless of scalability)?

Quite similar, but larger than the Euclidean distance

Task 2. Is the customer likely to receive an insurance payout?¶

In terms of machine learning, we can look at this as a binary classification task.

With insurance payouts being more than zero as a target, assess whether the kNN classification approach might be better than a dummy model.

Instructions:

  • Build a kNN-based classifier and measure its quality with the F1 metric for k=1..10 for both the original and scaled data. It would be interesting to see how k can influence the evaluation metric and whether the scalability of the data makes any difference. You can use a ready-made implementation of scikit-learn's kNN classification algorithm (check the link) or use your own.
  • Build the dummy model, which is random for this case. It should return the value "1" with some probability. LVamos test the model with four probability values: 0, the probability of making any insurance payment, 0.5, 1.

The probability of making any insurance payment can be defined as

$$ P\{\text{insurance payment received}=number of clients who received any insurance payment}}{\text{total number of clients}}. $$

Divide the whole data in the ratio 70:30 for the training/testing parts.

In [27]:
def update(insurance_benefits):
    if insurance_benefits >= 1:
        return 1
    else:
        return 0

df['insurance_benefits_received'] = df['insurance_benefits'].apply(update)
df_scaled['insurance_benefits_received'] = df_scaled['insurance_benefits'].apply(update)
In [29]:
display(df['insurance_benefits_received'].value_counts())
df_scaled['insurance_benefits_received'].value_counts()
insurance_benefits_received
0    4284
1     563
Name: count, dtype: int64
Out[29]:
insurance_benefits_received
0    4284
1     563
Name: count, dtype: int64
In [30]:
df_train, df_test = train_test_split(df, test_size=0.3, random_state=12345)

df_train2, df_test2 = train_test_split(df_scaled, test_size=0.3, random_state=12345)
In [32]:
features_train = df_train.drop(['insurance_benefits_received','insurance_benefits'], axis=1)
target_train = df_train['insurance_benefits_received']

features_test = df_test.drop(['insurance_benefits_received','insurance_benefits'], axis=1)
target_test = df_test['insurance_benefits_received']

features_train2 = df_train2.drop(['insurance_benefits_received','insurance_benefits'], axis=1)
target_train2 = df_train2['insurance_benefits_received']

features_test2 = df_test2.drop(['insurance_benefits_received','insurance_benefits'], axis=1)
target_test2 = df_test2['insurance_benefits_received'] 


print(features_train.shape)
print(target_train.shape)
print(features_test.shape)
print(target_test.shape)


print(features_train2.shape)
print(target_train2.shape)
print(features_test2.shape)
print(target_test2.shape)
(3392, 4)
(3392,)
(1455, 4)
(1455,)
(3392, 4)
(3392,)
(1455, 4)
(1455,)
In [33]:
for i in range(1,11):
    model = KNeighborsClassifier(n_neighbors=i)
    model.fit(features_train, target_train)
    test_predictions = model.predict(features_test)

    model2 = KNeighborsClassifier(n_neighbors=i)
    model2.fit(features_train2, target_train2)
    test_predictions2 = model2.predict(features_test2)
In [34]:
def eval_classifier(y_true, y_pred):
    
    f1_score = sklearn.metrics.f1_score(y_true, y_pred)
    print(f'F1: {f1_score:.2f}')
    
    cm = sklearn.metrics.confusion_matrix(y_true, y_pred, normalize='all')
    print('Matriz de Confusão')
    print(cm)
In [36]:
# generating random model output

def rnd_model_predict(P, size, seed=42):

    rng = np.random.default_rng(seed=seed)
    return rng.binomial(n=1, p=P, size=size)
In [37]:
for P in [0, df['insurance_benefits_received'].sum() / len(df), 0.5, 1]:

    print(f'A probabilidade: {P:.2f}')
    y_pred_rnd = rnd_model_predict(P, size=len(df['insurance_benefits_received']))
        
    eval_classifier(df['insurance_benefits_received'], y_pred_rnd)
    
    print()
A probabilidade: 0.00
F1: 0.00
Matriz de Confusão
[[0.88384568 0.        ]
 [0.11615432 0.        ]]

A probabilidade: 0.12
F1: 0.13
Matriz de Confusão
[[0.78502166 0.09882401]
 [0.1017124  0.01444192]]

A probabilidade: 0.50
F1: 0.19
Matriz de Confusão
[[0.44873117 0.4351145 ]
 [0.05921188 0.05694244]]

A probabilidade: 1.00
F1: 0.21
Matriz de Confusão
[[0.         0.88384568]
 [0.         0.11615432]]

In [38]:
for i in range(1,11):
    print('k=',i)
    model = KNeighborsClassifier(n_neighbors=i)
    model.fit(features_train, target_train)
    test_predictions = model.predict(features_test)
    eval_classifier(target_test, test_predictions)
    print('k scaled=',i)
    model2 = KNeighborsClassifier(n_neighbors=i)
    model2.fit(features_train2, target_train2)
    test_predictions2 = model2.predict(features_test2)
    eval_classifier(target_test2, test_predictions2)
    
k= 1
F1: 0.67
Matriz de Confusão
[[0.86323024 0.02130584]
 [0.04604811 0.06941581]]
k scaled= 1
F1: 0.93
Matriz de Confusão
[[0.87972509 0.004811  ]
 [0.01168385 0.10378007]]
k= 2
F1: 0.38
Matriz de Confusão
[[0.87972509 0.004811  ]
 [0.08728522 0.02817869]]
k scaled= 2
F1: 0.89
Matriz de Confusão
[[0.88247423 0.00206186]
 [0.02199313 0.09347079]]
k= 3
F1: 0.38
Matriz de Confusão
[[0.87079038 0.0137457 ]
 [0.08522337 0.03024055]]
k scaled= 3
F1: 0.91
Matriz de Confusão
[[0.88041237 0.00412371]
 [0.01649485 0.09896907]]
k= 4
F1: 0.18
Matriz de Confusão
[[0.8790378  0.00549828]
 [0.10378007 0.01168385]]
k scaled= 4
F1: 0.88
Matriz de Confusão
[[0.88178694 0.00274914]
 [0.0233677  0.09209622]]
k= 5
F1: 0.23
Matriz de Confusão
[[0.87766323 0.00687285]
 [0.09965636 0.01580756]]
k scaled= 5
F1: 0.89
Matriz de Confusão
[[0.87972509 0.004811  ]
 [0.0185567  0.09690722]]
k= 6
F1: 0.06
Matriz de Confusão
[[8.83848797e-01 6.87285223e-04]
 [1.12027491e-01 3.43642612e-03]]
k scaled= 6
F1: 0.87
Matriz de Confusão
[[0.88178694 0.00274914]
 [0.02405498 0.09140893]]
k= 7
F1: 0.07
Matriz de Confusão
[[0.88316151 0.00137457]
 [0.11134021 0.00412371]]
k scaled= 7
F1: 0.90
Matriz de Confusão
[[0.88178694 0.00274914]
 [0.01924399 0.09621993]]
k= 8
F1: 0.00
Matriz de Confusão
[[0.88453608 0.        ]
 [0.11546392 0.        ]]
k scaled= 8
F1: 0.86
Matriz de Confusão
[[0.88178694 0.00274914]
 [0.02542955 0.09003436]]
k= 9
F1: 0.01
Matriz de Confusão
[[8.84536082e-01 0.00000000e+00]
 [1.14776632e-01 6.87285223e-04]]
k scaled= 9
F1: 0.87
Matriz de Confusão
[[0.88178694 0.00274914]
 [0.02405498 0.09140893]]
k= 10
F1: 0.00
Matriz de Confusão
[[0.88453608 0.        ]
 [0.11546392 0.        ]]
k scaled= 10
F1: 0.86
Matriz de Confusão
[[0.88178694 0.00274914]
 [0.02542955 0.09003436]]
  • The highest f1 value found was with parameter k=1 with scaled data.

Task 3. Regression (with Linear Regression)¶

With insurance payments as the objective, evaluate what the REQM would be for a Linear Regression model.

Build your own implementation of Linear Regression. To do this, remember how the solution to the linear regression task is formulated in terms of linear algebra. Check the REQM for the original and scaled data. Can you see any difference in the REQM between these two cases?

Let's denote

  • $X$ - matrix of characteristics, each row is a case, each column is a characteristic, the first column consists of units
  • $y$ - goal (a vector)
  • $\hat{y}$ - estimated goal (a vector) - $w$ - weight vector

The linear regression task in matrix language can be formulated as $$ y = Xw $$

The training objective, then, is to find the $w$ that would minimize the L2 distance (EQM) between $Xw$ and $y$:

$$ \min_w d_2(Xw, y) \quad \text{or} \quad \min_w \text{MSE}(Xw, y) $$

It seems that there is an analytical solution to the above question:

$$ w = (X^T X)^{-1} X^T y $$

The above formula can be used to find the weights $w$ and the latter can be used to calculate predicted values

$$ \hat{y} = X_{val}w $$

Divide all the data in a 70:30 ratio for the training/validation parts. Use the REQM metric for model evaluation.

In [39]:
class MyLinearRegression:
    
    def __init__(self):
        self.weights = None
    
    def fit(self, X, y):
        X2 = np.append(np.ones([len(X), 1]), X, axis=1)
        
        self.weights = np.linalg.inv(X2.T @ X2) @ X2.T @ y

    def predict(self, X):
        X2 = np.append(np.ones([len(X), 1]), X, axis=1)
        
        y_pred = X2 @ self.weights
        
        return y_pred
In [40]:
def eval_regressor(y_true, y_pred):
    
    rmse = math.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    print(f'RMSE: {rmse:.2f}')
    
    r2_score = math.sqrt(sklearn.metrics.r2_score(y_true, y_pred))
    print(f'R2: {r2_score:.2f}')    
In [41]:
Xx = df[['age', 'gender', 'income', 'family_members']].to_numpy()
y = df['insurance_benefits'].to_numpy()

X_train, X_test, y_train, y_test = train_test_split(Xx, y, test_size=0.3, random_state=12345)

lr = MyLinearRegression()

lr.fit(X_train, y_train)
print(lr.weights)

y_test_pred = lr.predict(X_test)
eval_regressor(y_test, y_test_pred)
[-9.77366729e-01  3.58042291e-02  1.95594888e-02  5.85336165e-07
 -1.21618420e-02]
RMSE: 0.36
R2: 0.66
In [42]:
Xxx = df_scaled[['age', 'gender', 'income', 'family_members']].to_numpy()
yx = df_scaled['insurance_benefits'].to_numpy()

X_trainx, X_testx, y_trainx, y_testx = train_test_split(Xxx, yx, test_size=0.3, random_state=12345)

lrx = MyLinearRegression()

lrx.fit(X_trainx, y_trainx)
print(lrx.weights)

y_test_predx = lrx.predict(X_testx)
eval_regressor(y_testx, y_test_predx)
[-0.97736673  2.32727489  0.01955949  0.04624156 -0.07297105]
RMSE: 0.36
R2: 0.66
  • No difference in the RMSE in these two cases.

Task 4. Obfuscating data¶

It is better to obfuscate the data by multiplying the numerical characteristics (remember, they can be seen as the matrix $X$) by an invertible matrix $P$.

$$ X' = X \times P $$

Try this and check how the values of the characteristics will look after the transformation. By the way, invertibility is important here, so make sure that $P$ is really invertible.

You may want to review the lesson 'Matrices and Matrix Operations -> Matrix Multiplication' to recall the matrix multiplication rule and its implementation with NumPy.

In [43]:
personal_info_column_list = ['gender', 'age', 'income', 'family_members']
df_pn = df[personal_info_column_list]
In [44]:
X = df_pn.to_numpy()
X
Out[44]:
array([[1.00e+00, 4.10e+01, 4.96e+04, 1.00e+00],
       [0.00e+00, 4.60e+01, 3.80e+04, 1.00e+00],
       [0.00e+00, 2.90e+01, 2.10e+04, 0.00e+00],
       ...,
       [0.00e+00, 2.00e+01, 3.39e+04, 2.00e+00],
       [1.00e+00, 2.20e+01, 3.27e+04, 3.00e+00],
       [1.00e+00, 2.80e+01, 4.06e+04, 1.00e+00]])

Generating a random $P$ matrix.

In [45]:
rng = np.random.default_rng(seed=42)
P = rng.random(size=(X.shape[1], X.shape[1]))

Checking if the matrix $P$ is invertible

In [47]:
p1=np.linalg.inv(P)
p1
Out[47]:
array([[ 0.41467992, -1.43783972,  0.62798546,  1.14001268],
       [-1.06101789,  0.44219337,  0.1329549 ,  1.18425933],
       [ 1.42362442,  1.60461607, -2.0553823 , -1.53699695],
       [-0.11128575, -0.65813802,  1.74995517, -0.11816316]])

Can you guess the age or income of the customers after the transformation?

  • No
In [48]:
X1=X @ P
X1
Out[48]:
array([[ 6359.71527314, 22380.40467609, 18424.09074184, 46000.69669016],
       [ 4873.29406479, 17160.36702982, 14125.78076133, 35253.45577301],
       [ 2693.11742928,  9486.397744  ,  7808.83156024, 19484.86063067],
       ...,
       [ 4346.2234249 , 15289.24126492, 12586.16264392, 31433.50888552],
       [ 4194.09324155, 14751.9910242 , 12144.02930637, 30323.88763426],
       [ 5205.46827354, 18314.24814446, 15077.01370762, 37649.59295455]])

Can you recover the original data from $X′$ if you know $P$? Try checking this with calculations by moving $P$ from the right side of the formula above to the left. The rules of matrix multiplication are really useful here

In [49]:
XX=X1 @ p1
XX
Out[49]:
array([[ 1.00000000e+00,  4.10000000e+01,  4.96000000e+04,
         1.00000000e+00],
       [ 1.67952800e-12,  4.60000000e+01,  3.80000000e+04,
         1.00000000e+00],
       [-6.23021448e-13,  2.90000000e+01,  2.10000000e+04,
        -2.03032656e-13],
       ...,
       [ 1.57996161e-12,  2.00000000e+01,  3.39000000e+04,
         2.00000000e+00],
       [ 1.00000000e+00,  2.20000000e+01,  3.27000000e+04,
         3.00000000e+00],
       [ 1.00000000e+00,  2.80000000e+01,  4.06000000e+04,
         1.00000000e+00]])

Print all three cases for some customers- The original data

  • The transformed
  • The inverted (recovered)
In [50]:
display(X,X1,XX)
array([[1.00e+00, 4.10e+01, 4.96e+04, 1.00e+00],
       [0.00e+00, 4.60e+01, 3.80e+04, 1.00e+00],
       [0.00e+00, 2.90e+01, 2.10e+04, 0.00e+00],
       ...,
       [0.00e+00, 2.00e+01, 3.39e+04, 2.00e+00],
       [1.00e+00, 2.20e+01, 3.27e+04, 3.00e+00],
       [1.00e+00, 2.80e+01, 4.06e+04, 1.00e+00]])
array([[ 6359.71527314, 22380.40467609, 18424.09074184, 46000.69669016],
       [ 4873.29406479, 17160.36702982, 14125.78076133, 35253.45577301],
       [ 2693.11742928,  9486.397744  ,  7808.83156024, 19484.86063067],
       ...,
       [ 4346.2234249 , 15289.24126492, 12586.16264392, 31433.50888552],
       [ 4194.09324155, 14751.9910242 , 12144.02930637, 30323.88763426],
       [ 5205.46827354, 18314.24814446, 15077.01370762, 37649.59295455]])
array([[ 1.00000000e+00,  4.10000000e+01,  4.96000000e+04,
         1.00000000e+00],
       [ 1.67952800e-12,  4.60000000e+01,  3.80000000e+04,
         1.00000000e+00],
       [-6.23021448e-13,  2.90000000e+01,  2.10000000e+04,
        -2.03032656e-13],
       ...,
       [ 1.57996161e-12,  2.00000000e+01,  3.39000000e+04,
         2.00000000e+00],
       [ 1.00000000e+00,  2.20000000e+01,  3.27000000e+04,
         3.00000000e+00],
       [ 1.00000000e+00,  2.80000000e+01,  4.06000000e+04,
         1.00000000e+00]])

You can probably see that some values are not exactly the same as the original data. What could be the reason for this?

  • I believe this is due to the nature of the transformations applied and the complexity of the algorithms involved. Because it is a random matrix in obfuscation and the inverse of it in recovery.

Proof that data obfuscation can work with Linear Regression¶

The regression task has been solved with linear regression in this project. Your next task is to prove analytically that the obfuscation method provided will not affect the linear regression in terms of predicted values, i.e. its values will remain the same. Do you believe that? Well, you don't have to believe it, you have to prove it!

Thus, the data is obfuscated and there is $X \ P$ instead of just X now. Consequently, there are other weights $w_P$ such as $$ w = (X^T X)^{-1} X^T y \quad \Rightarrow \quad w_P = [(XP)^T XP]^{-1} (XP)^T y $$

How would $w$ and $w_P$ be linked if you simplified the formula for $w_P$ above?

What would be the predicted values with $w_P$?

What does this mean for the quality of the linear regression if you measure with REQM?

Check Appendix B Properties of Matrices at the end of the booklet. There are useful formulas there!

No code is needed in this section, just analytical explanation!

Answer

1- To simplify this formula, we can use the property of matrices which says that $(AB)^T = B^TA^T$. Applying this, we can rewrite 𝑤𝑃 as $$ 𝑤𝑃=(𝑋𝑃)^{-1}(𝑋𝑃)^𝑇𝑦 $$

2- To calculate the predicted values with 𝑤𝑃, you multiply the matrix of characteristics 𝑋𝑃 by the weights 𝑤𝑃. In other words, the predicted 𝑦̂𝑃 values would be $$𝑦̂𝑃=𝑋𝑃𝑤𝑃$$.

3- The REQM (Root Mean Square Error) is a common quality measure for evaluating the effectiveness of a regression model. Therefore, if the REQM is lower, it means that the predicted values are closer to the actual values, which indicates a better quality of the linear regression.

Analytical test

$$ w=(X^{T}X)^{-1}X^{T}y $$ $$ w_P = [(XP)^{T}(XP)]^{-1}(XP)^{T}y $$$$ w_P = [(P^{T}X^{T})(XP)]^{-1}(XP)^{T}y, $$

Using the reversibility property of transposing a product of matrices, we can go from $(XP)^{T}$ to $(P^{T}X^{T})$.

After this step, we can apply the inverse in a matrix product along with the associative property of multiplication,

$$ w_P = [P^{T}(X^{T}X)P]^{-1}(XP)^{T}y = P^{-1}(X^{T}X)^{-1}(P^{T})^{-1}P^{T}(X^{T}y). $$

Before the term $(X^{T}y)$ we have the term $(P^{T})^{-1}P^{T}$, but by the multiplicative identity property, we have that:

$$ (P^{T})^{-1}P^{T} = I, $$

Continuing:

$$ w_P = P^{-1}(X^{T}X)^{-1}(P^{T})^{-1}P^{T}(X^{T}y) = P^{-1}(X^{T}X)^{-1}I(X^{T}y) = P^{-1}(X^{T}X)^{-1}(X^{T}y). $$

Well, if $w = (X^{T}X)^{-1}X^{T}y$, then:

$$ w_P = P^{-1}(X^{T}X)^{-1}(X^{T}y) = P^{-1}w, $$

then the unknown matrix $A$ is equal to $P^{-1}$ and:

$$ w_P = P^{-1}w. $$

To demonstrate analytically that data obfuscation does not affect linear regression, it is sufficient to substitute the value of 𝑤𝑃 in the following equation with the previously established relationship:

$$ \hat{y_P} = (X_{val}P)w_P = X_{val}PP^{-1}w, $$

by the multiplicative identity property, we have that:

$$ \hat{y_P} = X_{val}PP^{-1}w = X_{val}Iw = X_{val}w = \hat{y}. $$

Thus, we have shown that $\hat{y_P} = \hat{y}$, which implies that data obfuscation has no effect on the forecast made, although it can modify the estimated coefficients. As a result, since the forecast remains unchanged, no error metric will change.

In [51]:
def calculate_w(X, y):
    XTX_inv = np.linalg.inv(np.dot(X.T, X))
    w = np.dot(np.dot(XTX_inv, X.T), y)
    return w

def calculate_wP(XP, y):
    wP = np.dot(np.dot(np.linalg.inv(np.dot(XP.T, XP)), XP.T), y)
    return wP




XP = Xx + 1 

w = calculate_w(Xx, y)
wP = calculate_wP(XP, y)

print("w:", w)
print("wP:", wP)
w: [ 2.35998538e-02 -4.39473393e-02 -1.18490564e-05 -4.71380350e-02]
wP: [ 2.62638388e-02 -1.18679593e-01 -9.07489582e-06 -5.77014504e-02]
In [52]:
def predict_values(X, w):
    y_pred = np.dot(X, w)
    return y_pred

def predict_values_P(XP, wP):
    y_pred_P = np.dot(XP, wP)
    return y_pred_P

y_pred = predict_values(Xx, w)
y_pred_P = predict_values_P(XP, wP)

print("y_pred:", y_pred)
print("y_pred_P:", y_pred_P)
y_pred: [ 0.28879544  0.5881911   0.43556558 ... -0.023962   -0.0536288
  0.08863884]
y_pred_P: [ 0.30019523  0.65546281  0.42095223 ... -0.04789137 -0.16085487
  0.04043939]
In [53]:
def calculate_RMSE(y_true, y_pred):
    mse = np.mean((y_true - y_pred) ** 2)
    rmse = np.sqrt(mse)
    return rmse

RMSE_w = calculate_RMSE(y, y_pred)
RMSE_wP = calculate_RMSE(y, y_pred_P)

print("RMSE for w:", RMSE_w)
print("RMSE for wP:", RMSE_wP)
RMSE for w: 0.39042934975680826
RMSE for wP: 0.3829889425603623

Linear regression test with data obfuscation¶

Agora, vamos provar que a Regressão Linear pode funcionar computacionalmente com a transformação de ofuscação escolhida. Crie um procedimento ou uma classe que execute a Regressão Linear opcionalmente com a ofuscação. Você pode usar uma implementação pronta de Regressão Linear do scikit-learn ou sua própria.

Execute a Regressão Linear para os dados originais e os ofuscados, compare os valores previstos e os valores da métrica $R^2$ do REQM. Há alguma diferença?

Procedure

  • Create a square matrix $P$ of random numbers.
  • Check that it is invertible. If not, repeat the first point until we get an invertible matrix.
  • <! your comment here!
  • Use $XP$ as the new matrix of characteristics
In [54]:
def ev(y_true,y_pred):
    rmse = math.sqrt(sklearn.metrics.mean_squared_error(y_true, y_pred))
    print(f'REQM: {rmse:.2f}')
In [55]:
G=df[['age', 'gender', 'income', 'family_members']].to_numpy()
In [56]:
B=df_scaled[['age', 'gender', 'income', 'family_members']].to_numpy()
In [57]:
rng = np.random.default_rng(seed=42)
Pw = rng.random(size=(G.shape[1], G.shape[1]))
Pw
Out[57]:
array([[0.77395605, 0.43887844, 0.85859792, 0.69736803],
       [0.09417735, 0.97562235, 0.7611397 , 0.78606431],
       [0.12811363, 0.45038594, 0.37079802, 0.92676499],
       [0.64386512, 0.82276161, 0.4434142 , 0.22723872]])
In [58]:
rng = np.random.default_rng(seed=42)
Pb = rng.random(size=(B.shape[1], B.shape[1]))
Pb
Out[58]:
array([[0.77395605, 0.43887844, 0.85859792, 0.69736803],
       [0.09417735, 0.97562235, 0.7611397 , 0.78606431],
       [0.12811363, 0.45038594, 0.37079802, 0.92676499],
       [0.64386512, 0.82276161, 0.4434142 , 0.22723872]])
In [59]:
oo = np.linalg.inv(Pw)
In [60]:
bb=np.linalg.inv(Pb)
In [61]:
mo=G@oo
mo
Out[61]:
array([[  70627.60073038,   79529.78987286, -101919.33167188,
         -76187.24189014],
       [  54116.69189337,   60908.61202825,  -78073.89004158,
         -58353.56151613],
       [  29908.13850608,   33655.24019185,  -43144.81668183,
         -32243.8754918 ],
       ...,
       [  48268.93881382,   54366.41182146,  -69661.40028605,
         -52081.63253157],
       [  46560.24656784,   52437.78090918,  -67191.80264736,
         -50233.89007523],
       [  57809.59012496,   65106.93712762,  -83429.05479992,
         -62369.08954376]])
In [62]:
mb=B@bb
mb
Out[62]:
array([[ 0.07582171,  0.43301381, -0.46973987,  0.91864977],
       [ 0.95969953, -0.35539713, -0.25258523,  0.04776935],
       [ 0.56344285, -0.21495421, -0.26618932,  0.10005223],
       ...,
       [ 0.70139563,  0.02677154, -0.10544848, -0.34816138],
       [-0.38703517,  0.2906601 ,  0.36970924,  0.87482819],
       [-0.16929933,  0.53777739, -0.36117949,  0.86574816]])
In [63]:
x1 = mo
y1 = G


X_train1, X_test1, y_train1, y_test1 = train_test_split(x1, y1, test_size=0.3, random_state=12345)

lr1 = MyLinearRegression()

lr1.fit(X_train1, y_train1)
print(lr1.weights)

ev(y_train1, lr1.predict(X_train1))
ev(y_test1, lr1.predict(X_test1))
[[-1.81888934e-04  2.94854481e-06 -2.58061376e-01 -3.81844513e-06]
 [ 7.73951588e-01  4.38874517e-01  8.58598326e-01  6.97365564e-01]
 [ 9.41682058e-02  9.75614505e-01  7.61134554e-01  7.86059092e-01]
 [ 1.28108868e-01  4.50381707e-01  3.70796065e-01  9.26762405e-01]
 [ 6.43857815e-01  8.22755447e-01  4.43410432e-01  2.27234451e-01]]
REQM: 0.09
REQM: 0.09
In [64]:
lr1.fit(X_test1, y_test1)
ev(y_train1, lr1.predict(X_train1))
ev(y_test1, lr1.predict(X_test1))
REQM: 0.10
REQM: 0.10
In [65]:
x2 = mb
y2 = B

X_train2, X_test2, y_train2, y_test2 = train_test_split(x2, y2, test_size=0.3, random_state=12345)

lr2 = MyLinearRegression()

lr2.fit(X_train2, y_train2)
print(lr2.weights)
ev(y_train2, lr2.predict(X_train2))
ev(y_test2, lr2.predict(X_test2))
[[ 4.61436445e-15 -3.74006381e-15  3.04617442e-15  5.03069808e-16]
 [ 7.73956049e-01  4.38878440e-01  8.58597920e-01  6.97368029e-01]
 [ 9.41773479e-02  9.75622352e-01  7.61139702e-01  7.86064305e-01]
 [ 1.28113633e-01  4.50385938e-01  3.70798024e-01  9.26764989e-01]
 [ 6.43865120e-01  8.22761613e-01  4.43414199e-01  2.27238722e-01]]
REQM: 0.00
REQM: 0.00
In [66]:
lr2.fit(X_test2, y_test2)
ev(y_train2, lr2.predict(X_train2))
ev(y_test2, lr2.predict(X_test2))
REQM: 0.00
REQM: 0.00

Conclusions¶

  • There are no very clear groups, but in a 3d graph you can get a better idea of how the dataframe behaves.
  • Data scaling improves the performance of Knn, and usually of all tests.
  • Proof that REQM does not change with data obfuscation has been provided.